
Experimental_hall_of_the_Wendelstein_7-X_stellarator_at_IPP_Greifswald_03 [commons]
Recently I handed in a follow-up refactoring I promised to a colleague in response to one of his comments. The code in question was for picking a random month between now and some predetermined start month in the past, in order to do some processing of business data for that month. During that refactoring, he remarked that two variables got put at the top level of the namespace, one for the start year, one for the start month. What was bugging him, he said, was that while the start year made sense, the ability to configure the start month didn’t, since it wasn’t supported by the underlying math, as coded.
Which I fully agreed with. I never would write code like this. Which brings us to the topic, namely that I didn’t. Claude did. Claude extracted the variables to the top level, laudable in itself as an action. The problem with the code is that a programmer who sees the start value in the future might feel invited to change it. Which would have made the unit tests fail. First of all, the reason why there is no question that this is bad code is precisely the confusion that would ensue. Which one is right: the seemingly stated intent of a configurable start month, by virtue of having that variable? Or the tests, which by common understanding are also taken as encoding intent? The problem is the obvious contradiction here (it would be less confusing, were the code very obscure, for example, as that would shift one’s attention to the tests).
I think a discussion of this episode is quite important, given that it’s 2025, and from here on we can expect to see more code written (not only co-authored) by LLMs. I concur with many folks on the hype side that the more you do it, the more you start trusting the thing. I underwent what is probably a fairly typical transition from “I watch it like a hawk” to just letting it run. Especially since Opus 4.5.
But the fear is, of course, ending up with an unmaintainable mess, a Big Ball of Mud.
Is it justified, though?
I think in one way nothing really changes the judgment calls of the software engineer about what the desired structure, code quality, test coverage, etc. have to be. In the opening anecdote, for example, we can mention that the code was relatively isolated, was only a reporting feature, and was not likely to be built on top of upon (and if so, one could make the case that a pre-cleanup would have to take place before building the new thing; which is common practice).
Also, it was reviewed. And it is perfectly possible to guarantee code gets reviewed with an eye to maintaining the desired qualities (this can, of course, also be translated into an LLM-assisted review process). Plus, I must say, in settings where many programmers touch the same codebase (without a primary maintainer, I guess), I often can’t discern a clear code structure anyway. And the code is often of mixed quality.
I understand the intuition that nothing excuses bad code, but one also has to perhaps say that apart from pristine Donald Knuth algorithms there is never really such a thing as good code—only less bad code. It is also inexcusable in a corporate setting for developers to follow their own rabbit holes to endlessly sculpt their code to perfection, in pursuit of beauty, while their colleagues bear the load and work under the usual stresses and constraints of daily business. There is a blurry line between craftsmanship and narcissism here. What I’m getting at here is that it is easy to react to LLM-generated code with a kind of prejudice which does not necessarily hold up under scrutiny. LLMs may produce sub-optimal code, but so do humans.
As a final point, besides the quality question, the other thing that is interesting to me, is whether the participation of AI agents in the process of writing code changes anything with regard to the question of Source of Truth,[1] with respect to the relationship between test and code. My gut feeling is that probably yes. Despite the fact that usually there is no 100% test coverage, I would consider tests in such a setting more important, the equation shifting in their favour as the source of truth. I think one argument for this would be that tests, which as I said we can understand as stated intent, are probably easier to translate into another language than code.
What that means, concretely, if one chooses to follow that logic, is, that one should watch the tests more attentively than the produced code. We might want to lean increasingly on judging the health of the codebase by the tests it passes (in addition to any chosen standards or metrics we apply to ensure code quality, if you wish). It is really important to make sure the LLM doesn’t delete tests for no reason, or change them to succeed by reward hacking. But, on the other hand, if the tests pass, they pass. The functional requirements are met. If anything, this should induce us to write more tests, and make sure those are structured to convey intent as clearly as possible.[2]
To remind ourselves, aside from all the hype or skepticism, the reason for even having this discussion is that there is a strong argument to employ LLMs to write code. Namely that you can undeniably write more of it. In my case, and please don’t read that as an excuse, as I willingly admitted to my “mistake,“ above, the particular circumstance in which the pull request came to be, was after a week of working on a totally out-of-the-blue ad-hoc side-quest which got thrown at me. And I was still able to finish all my remaining tickets, thanks to my agent ;) What I’m saying is, yes, the worries are justified, but the added leverage is also undeniable. We as engineers keep doing what we do, and continue to make all the necessary kinds of judgment calls expected of us, all the while just shipping more code.
Footnotes
I wrote about that here: Source of Truth in SWE. link.
At the end of the Source of Truth article I touched on the topic of post-hoc documentation. This can apply here as well. In fact, I always tended to write tests alongside (or even after) the code had been written, instead of the ideal test-first.